The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
#!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score
from imblearn.over_sampling import SMOTE
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
#mounting the googledrive to see the dataset file saved on google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
#Accessing google drive to read dataset
churn = pd.read_csv('/content/drive/My Drive/BankChurners.csv')
The initial steps to get an overview of any dataset is to:
observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not get information about the number of rows and columns in the dataset find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected. check the statistical summary of the dataset to get an overview of the numerical columns of the data.
# Checking the number of rows and columns in the training data
churn.shape
(10127, 21)
Observation - Dataset has 10127 rows & 21 columns.
# Creating a copy of the data to another variable to avoid any changes to original data
data = churn.copy()
# Viewing the first 5 rows of the dataset
data.head(5)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
Observation - Shows top 5 rows of the data set.
# Viewing the last 5 rows of the dataset
data.tail(5)
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | 3 | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | 4 | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | 5 | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | 4 | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | 6 | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
Observation - Shows last 5 rows of the data set.
# checking the data types of the columns in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Observations -
# Checking duplicate values in the dataset
data.duplicated().sum()
0
Observation- Data doesnt have any duplicate values.
# Check for missing values in the data -
data.isnull().sum()
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observation -
#Checking statistical summary of the numerical columns in the data
data. describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations -
CLIENTNUM - This column can be dropped,as this is unique ID column for all customers and will not add up anything for further analysis.
Customer_Age - Average customers are 46 years, a min is 26, and a max is 73 years.
Dependent_count - Average custmers has 2 dependents, There are no minimum dependents, and a max of 5 dependents.
Months_on_book AVergae customers on books are 35.9 months, a min of 13, and a max of 56 months.
Total_Relationship_Count Average customer relationshio count is almost4 products with the bank, a min of 1, and a max of 6 products with the bank.
Months_Inactive_12_mon Average Inactive months are 2.3 months, a min of 0, and a max of 6 months.
Contacts_Count_12_mon Average customers been contacted of 2.4 times , a min of 0, and a max of 6 contacts.
Credit_Limit Average Credit limit is 8632 dollars, a min of 1438, and a max of 34516 dollars (rounded to nearest dollar). This is a very large range.
Total_Revolving_Bal Average revolving bal is 1163 dollars, a min of 0, and a max of 2517.
Avg_Open_To_Buy Average Open to Buy Credit Line is 7469 dollars, a min of 3, and a max of 34516 dollar. This is a very large range.
Total_Amt_Chng_Q4_Q1 Average amount changed is 0.76, a min of 0, and a max of 3.397. This is a ratio of amount spent in Q4 to amount spend in Q1 (Q4/Q1).
Total_Trans_Amt Average transaction 4404 dollars, a min of 510, and a max of 18484 dollars.
Total_Trans_Ct Average 64.8 are total transactions, min of 10, and a max of 139 total transactions.
Total_Ct_Chng_Q4_Q1 total change in transation count is 0.71, a min of 0, and a max of 3.71. This is a ratio of number of transactions in Q4 to number of transactions in Q1 (Q4/Q1).
Avg_Utilization_Ratio Average card utilization ratio is 27.5%, a min of 0%, and a max of 99.9%. This is the customers percent of credit used.
#Observing Object /categories from the dataset
data.describe(include=["object"]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
Observations
Attrition_Flag has 10127 non-null entries and 2 unique entries, with the most frequent being "Existing Customer".
Gender has 10127 non-null entries and 2 unique entries, with the most frequent being "F".
Education_Level has 8608 non-null entries and 6 unique entries, with the most frequent being "Graduate".
Null values are present and will be imputed after data is split into training, validation, and test sets to avoid data leakage.
Marital_Status has 9369 non-null entries and 3 unique entries, with the most frequent being "Married".
Null values are present and will be imputed after data is split into training, validation, and test sets to avoid data leakage.
Income_Category has 10127 non-null entries and 6 unique entries, with the most frequent being "Less than 40k"
Card_Category has 10127 non-null entries and 4 unique entries, with the most frequent being "Blue".
# Understanding Unique values for each Object column
for i in data.describe(include=["object"]).columns:
print("Unique values in", i, "are :")
print(data[i].value_counts())
print("*" * 50)
Unique values in Attrition_Flag are : Attrition_Flag Existing Customer 8500 Attrited Customer 1627 Name: count, dtype: int64 ************************************************** Unique values in Gender are : Gender F 5358 M 4769 Name: count, dtype: int64 ************************************************** Unique values in Education_Level are : Education_Level Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: count, dtype: int64 ************************************************** Unique values in Marital_Status are : Marital_Status Married 4687 Single 3943 Divorced 748 Name: count, dtype: int64 ************************************************** Unique values in Income_Category are : Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: count, dtype: int64 ************************************************** Unique values in Card_Category are : Card_Category Blue 9436 Silver 555 Gold 116 Platinum 20 Name: count, dtype: int64 **************************************************
Observations Income_Category shows 1112 values as abc, which are needed to be treated with further analysis.
# CLIENTNUM consists of unique IDs of clients and hence, will not add any value to the modeling so dropping it.
data.drop(["CLIENTNUM"], axis=1, inplace=True)
## Encoding Existing and Attrited customers to 0 and 1 respectively, for analysis.
#Encoding "Attrition_Flag" data to 0 & 1, where 0 representing "Existing Customer" and 1 representing "Attrited Customer" for further analysis or modeling purposes.
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)
#Verifying Top 5 rows of the new data frame after replacement
data.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 0 | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 0 | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 0 | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 0 | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
Observation Attrition_Flag column shows Zero values instead of 'Existing customer' or'Atrited customer'
# Checking the values of the Income_Category column.
data['Income_Category'].value_counts()
Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: count, dtype: int64
# Replacing "abc" entries in the Income_Category column with np.nan.
data['Income_Category'].replace('abc', np.nan, inplace=True)
# Checking the new values of the Income_Category column.
data['Income_Category'].value_counts()
Income_Category Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 $120K + 727 Name: count, dtype: int64
# Observing the amount of non-null values in the Income_Category column.
data['Income_Category'].info()
<class 'pandas.core.series.Series'> RangeIndex: 10127 entries, 0 to 10126 Series name: Income_Category Non-Null Count Dtype -------------- ----- 9015 non-null object dtypes: object(1) memory usage: 79.2+ KB
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
Customer_Age
histogram_boxplot(data, "Customer_Age", kde=True)
Observations
*Average customer age is 46 yr. data is normally distributed.
*Plot shows there are few outliers to the right for this variable.
Months_on_book
histogram_boxplot (data,"Months_on_book", kde=True)
Observations
*'Months on Book' data looks normally distributed, with a high frequency of Mode.
*Data shows outliers,meaning possibly there are incorrect values.
*The avergae months on the book is 35 months.
Credit_Limit
histogram_boxplot(data,"Credit_Limit", kde=True)
Observations
*Credit limit data is highly right skewed. The data shows too many outliers,means need further analysis.It seems like these values are just outside the range, but may be the credit limits.
*The average credit limit is approx 8500.
*Only one data point shows high credit limit i.e 35000, which need to be treated.
Total_Revolving_Bal
histogram_boxplot(data,"Total_Revolving_Bal",kde=True)
Observations
*Max customers shows zero 'Total Revolving Bal',this data is slighly left skewed.
*Average 'Total_Revolving_bal' is around 1250 with the mean being slightly lower.
*2500 is max Total revolving balance.
Avg_Open_To_Buy
histogram_boxplot(data,"Avg_Open_To_Buy",kde=True)
Observations -
*Avg open to buy column data is right skewed.
*There are many outliers observed in the data to the right side.
Total_Trans_Ct
histogram_boxplot(data,"Total_Trans_Ct",kde=True)
Observations -
*Total Trans Ct column data is fairly distributed.
*There are negligible outliers observed on right.
Total_Amt_Chng_Q4_Q1
histogram_boxplot(data,"Total_Amt_Chng_Q4_Q1",kde=True)
Observations -
*Total amt chng Q4_Q1 ' data has lot of outliers, which needs to be treated.
*data is fairly distributed with more outliers on the right side compared to left side.
Let's see total transaction amount distributed
Total_Trans_Amt
histogram_boxplot(data,"Total_Trans_Amt",kde=True)
Observations -
*Total Trans Amt data has lots of outliers on right, means this column needs to be observed closly & treat the values.
*Data is uneually distributed & shows slightly right skewed.
*No customer shows 0 Trans Amount means all customers used their credit cards.
Total_Ct_Chng_Q4_Q1
histogram_boxplot(data,"Total_Ct_Chng_Q4_Q1",kde=True)
Observations -
*"Total_Ct_Chng_Q4_Q1" column has many outliers, specially on right side compared to left side, which needs treatment.
*Data distribution seems normal.
Avg_Utilization_Ratio
histogram_boxplot(data,"Avg_Utilization_Ratio",kde=True)
Observations -
*Avg_Utilization_Ratio Data is highly right skewed.
*Observation - Most of the customers have low Avg Utilization ratio meaning (how much available credit customers spent)
Dependent_count
labeled_barplot(data, "Dependent_count")
Observation
*Customers with Dependent count as 3 shows the most credit card spending.
*Customers with dependent count 5 show lowest credit card spending.
*Average customer spending is beyond 1500.
Total_Relationship_Count
labeled_barplot(data,"Total_Relationship_Count")
Observations -
*More number of Customers (2305) shows 3 yrs of Total relationship with Terabank.
*There very less customers shows 1 yr of total relationship with Bank.
*On an Avergae most number of customers have more than 3 years of total relationship with bank.
Months_Inactive_12_mon
labeled_barplot(data,"Months_Inactive_12_mon")
Observation -
*Maximum number of customers are inactive since last 3 months
3282 customers are 2 months inactive. 2233 customers show 1 months inactive.
Contacts_Count_12_mon
labeled_barplot(data,'Contacts_Count_12_mon')
Observations
*33% of customers have been contacted 3 times in the last 12 months.
*31% of customers have been contacted 2 times in the last 12 months.
*14% of customers have been contacted 1 times in the last 12 months.
Gender
labeled_barplot(data,'Gender')
Observations - There are more number of female (5358) customers than Male (4769) customers
Let's see the distribution of the level of education of customers
Education_Level
labeled_barplot(data,'Education_Level')
Observations -
Education column Data shows , mix of all level education of customers holding credit card. The Maximum number of customers are Grduates (3128), There are significant ammount of High school customers (2013.0) in the dataset. The least number of customers(451.0) are Doctorate.There are 516 customers which are Post-Graduates. There are significat customers(1487.0) which are uneducated. Inference - based on these observations, we must stidy High school student catagory & Uneducated customer catagory as high risk potential for credit card Churn.
Marital_Status
labeled_barplot(data,'Marital_Status')
Observation - Maximum customers are Married (4687.0), however significant cusomers holding credit cards are single too (3943). There are less number of Divorced customers (748.0)
Let's see the distribution of the level of income of customers
Income_Category
labeled_barplot(data,'Income_Category')
Observations - The maximum number of customers(3561.0) have income bracket of Less than 40K. , however there are some customers with income beyond $120 K. The 'abc' bar in the dataset reflects that there may be some anamolous data in this column,which needs further analysis.
Card_Category
labeled_barplot(data,'Card_Category')
Observations - Maximum customers have 'Blue' credit card category. Negligible customers have Platinum credit cards, which can be removed from the dataset as those customers doesnt add much value to remaining data points. There is major difference between Blue cc holders & Silver cc holders(555.0). There are very less customers(116.0) who has Gold CCs.
Attrition_Flag
labeled_barplot(data,'Attrition_Flag')
Observations - Attriction_Flag, Zero reflects existing customers.means they are still customers with Terabank & not attrited.
*There are total 1627 customers who attrited recently.
*This is the target variable.
# creating histograms
data.hist(figsize=(14, 14))
plt.show()
Observations Generated all histplots at once, to explore overall data distribution for each column in the dataset. Further will Adjust visualization parameters as needed, based on the characteristics of the data and analysis further objectives.
Attrition_Flag vs Gender
stacked_barplot(data, "Gender", "Attrition_Flag")
Attrition_Flag 0 1 All Gender All 8500 1627 10127 F 4428 930 5358 M 4072 697 4769 ------------------------------------------------------------------------------------------------------------------------
Observations Plot shows there is a negligible difference between Male / Female Gender in terms of Attrition flag. meaning both genders customer Attrited almost equally. Gender column doesnt show much impact on Attrition Flag.
Attrition_Flag vs Marital_Status
stacked_barplot(data, "Marital_Status","Attrition_Flag")
Attrition_Flag 0 1 All Marital_Status All 7880 1498 9378 Married 3978 709 4687 Single 3275 668 3943 Divorced 627 121 748 ------------------------------------------------------------------------------------------------------------------------
Observations Marital status also doesnt show any major impact on Attrition flag.
Education_Level Vs Attrition_Flag
stacked_barplot(data, 'Education_Level', 'Attrition_Flag')
Attrition_Flag 0 1 All Education_Level All 7237 1371 8608 Graduate 2641 487 3128 High School 1707 306 2013 Uneducated 1250 237 1487 College 859 154 1013 Doctorate 356 95 451 Post-Graduate 424 92 516 ------------------------------------------------------------------------------------------------------------------------
Observations Education shows not major impact on Attrition rate.
Attrition_Flag vs Income_Category
stacked_barplot(data,"Income_Category", "Attrition_Flag")
Attrition_Flag 0 1 All Income_Category All 7575 1440 9015 Less than $40K 2949 612 3561 $40K - $60K 1519 271 1790 $80K - $120K 1293 242 1535 $60K - $80K 1213 189 1402 $120K + 601 126 727 ------------------------------------------------------------------------------------------------------------------------
Observations - Income_category doesnt show any significat impact on Attrition rate.
Attrition_Flag vs Contacts_Count_12_mon
stacked_barplot(data,"Contacts_Count_12_mon","Attrition_Flag")
Attrition_Flag 0 1 All Contacts_Count_12_mon All 8500 1627 10127 3 2699 681 3380 2 2824 403 3227 4 1077 315 1392 1 1391 108 1499 5 117 59 176 6 0 54 54 0 392 7 399 ------------------------------------------------------------------------------------------------------------------------
observations
Customers with less contacts in the last 12 months attrited more often.
Let's see the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)
Attrition_Flag vs Months_Inactive_12_mon
stacked_barplot(data,"Months_Inactive_12_mon", "Attrition_Flag")
Attrition_Flag 0 1 All Months_Inactive_12_mon All 8500 1627 10127 3 3020 826 3846 2 2777 505 3282 4 305 130 435 1 2133 100 2233 5 146 32 178 6 105 19 124 0 14 15 29 ------------------------------------------------------------------------------------------------------------------------
Observations Months_Inactice_12_mon does have some affect on attrition.
Attrition_Flag vs Total_Relationship_Count
stacked_barplot(data,"Total_Relationship_Count", "Attrition_Flag")
Attrition_Flag 0 1 All Total_Relationship_Count All 8500 1627 10127 3 1905 400 2305 2 897 346 1243 1 677 233 910 5 1664 227 1891 4 1687 225 1912 6 1670 196 1866 ------------------------------------------------------------------------------------------------------------------------
Observations Customers that have 1 or 2 products with the bank attrit the most, followed by customers who have 3 products. Customers that have either 4, 5, or 6 products with the bank attrit at nearly the same rates.
Attrition_Flag vs Dependent_count
stacked_barplot(data,"Dependent_count", "Attrition_Flag")
Attrition_Flag 0 1 All Dependent_count All 8500 1627 10127 3 2250 482 2732 2 2238 417 2655 1 1569 269 1838 4 1314 260 1574 0 769 135 904 5 360 64 424 ------------------------------------------------------------------------------------------------------------------------
Observation From this stacked barplot, Dependent_Count does not show much effect on attrition.
Questions:
1. How is the total transaction amount distributed?
Total Trans Amt data has lots of outliers on right, means this column needs to be observed closly & treat the values.Data for this column is uneually distributed & shows slightly right skewness. No customer shows 0 Trans Amount means all customers used their credit cards.The Total_Trans_Amt distribution appears consistent across both existing and attrited customers. *Attrited customers have a median Total_Trans_Amt of 2500 Dollars, whereas existing customers show a higher median, nearing 4000 Dollars. Notably, the Interquartile Range (IQR) of Total_Trans_Amt for attrited customers is considerably narrower compared to existing customers. Additionally, the maximum Total_Trans_Amt for attrited customers is roughly half that of existing customers.
2. What is the distribution of the level of education of customers? Education column Data shows , mix of all level education of customers holding credit card. The Maximum number of customers are Grduates (3128), There are significant ammount of High school customers (2013.0) in the dataset. The least number of customers(451.0) are Doctorate.There are 516 customers which are Post-Graduates. *There are significat customers(1487.0) which are uneducated.Based on these observations, we must study High school student catagory & Uneducated customer catagory as high risk potential for credit card Churn.
3.What is the distribution of the level of income of customers? The maximum number of customers(3561.0) have income bracket of Less than 40K, however there are some customers with income beyond $120 K. 39% of customers make less than 40k. 19% of customers make between 40k - 60k. 17% of customers make between 80k - 120k.
4. How does the change in transaction amount between Q4 and Q1(total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
total change in transation count is 0.71, a min of 0, and a max of 3.71. This is a ratio of number of transactions in Q4 to number of transactions in Q1 (Q4/Q1).
Distribution of Total_Ct_Chng_Q4_Q1 is centered around 0.5 for attrited customers, while its centered around 0.7 for existing customers. Median of Total_Ct_Chng_Q4_Q1 for existing customers is greater than that of 75% of attrited customers. Min of Total_Ct_Chng_Q4_Q1 for existing customer much greater than that of attrited customers. Max looks same for both existing & attrited customers.Total_Ct_Chng_Q4_Q1 for both attrited and existing customers are normally distributed.
5. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
Average Inactive months are 2.3 months, a min of 0, and a max of 6 months.
Maximum number of customers are inactive since last 3 months
3282 customers are 2 months inactive. 2233 customers show 1 months inactive.
33% of customers have been contacted 3 times in the last 12 months.
31% of customers have been contacted 2 times in the last 12 months.
14% of customers have been contacted 1 times in the last 12 months.
6. What are the attributes that have a strong correlation with each other?
Avg_Open_to_Buy and Credit_Limit are completely positively correlated by necessity. As a customer's credit limit goes up, their open to buy also increases.Total_Trans_Amt and Total_Trans_Ct are very highly positively correlated. This is natural as the more number of transations a customer makes, the more money customer will spend. Customer_Age and Months_on_book are highly positively correlated, as customer Age increases, their time with the bank increases.Total_Revolving balance and Avg_Utilization_Ratio is positively correlated. This makes sense because if a customer has a high utilization, they will likely have a higher revolving balance. Avg_Open_To_Buy and Avg_Utilization_Ratio are negatively correlated as higher the customers utilization is, the less their amount open to buy will be.Credit_Limit and Avg_Utilization_Ratio are negatively correlated,as customers with a higher credit limit tend to have a lower utilization.
Total_Revolving_Bal vs `Attrition_Flag'
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag")
Observations -
Total_Revolving_Bal shows similar distributions for both attrited and existing customers. The existing customers have a bulge in the center. Attrited customers have peaks at both the min and max of the distribution. The median Total_Revolving_Bal for existing customers is higher than that of more than the attrited customers.
Attrition_Flag vs Credit_Limit
distribution_plot_wrt_target(data, "Credit_Limit", "Attrition_Flag")
Observations Credit_Limit plots shows almost identical distribution for existing customer and attrited customers.
Attrition_Flag vs Customer_Age
distribution_plot_wrt_target(data, "Customer_Age", "Attrition_Flag")
Observations Customer_Age plot shows almost identical distribution for existing customer and attrited customers.
distribution_plot_wrt_target(data, "Months_Inactive_12_mon", "Attrition_Flag")
Observations Average Inactive months are 2 months, a min of 0, and a max of 6 months. Maximum number of customers are inactive since last 3 months.
Total_Trans_Ct vs Attrition_Flag
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag")
Observations Total_Trans_Ct plots shows normally distributed for attrited customers.
Attrited customers have a much lower median and max Total_Trans_Ct than existing customers. The distribution of Total_Trans_Ct for attrited customers is centered around 50 while for existing customers its around 70.
Total_Trans_Amt vs Attrition_Flag
distribution_plot_wrt_target(data, "Total_Trans_Amt", "Attrition_Flag")
Observations The distribution of Total_Trans_Amt looks similar for both existing and attrited customers. The median Total_Trans_Amt for attrited customers is 2500, while the median for existing customers is closer to 4000. The IQR of Total_Trans_Amt for attrited customers is much smaller than that of existing customers. The maximum Total_Trans_Amt for attrited customers is about half as much compared to exsiting customers.
Let's see the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)
Total_Ct_Chng_Q4_Q1 vs Attrition_Flag
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag")
Observations Total_Ct_Chng_Q4_Q1 for both attrited and existing customers are normally distributed. Distribution of Total_Ct_Chng_Q4_Q1 is centered around 0.5 for attrited customers, while its centered around 0.7 for existing customers. Median of Total_Ct_Chng_Q4_Q1 for existing customers is greater than that of 75% of attrited customers. Min of Total_Ct_Chng_Q4_Q1 for existing customer much greater than that of attrited customers. Max looks same for both existing & attrited customers.
Avg_Utilization_Ratio vs Attrition_Flag
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag")
Observations The median Avg_Utilization_Ratio for attrited customers is 20% & its 0% for existing customers. Close to 75% of existing customers have an Avg_Utilization_Ratio less than the median of attrited customers.
Attrition_Flag vs Months_on_book
distribution_plot_wrt_target(data, "Months_on_book", "Attrition_Flag")
Observations Month on Books data is normally distributed for existing customer and attrited customers.
Attrition_Flag vs Avg_Open_To_Buy
distribution_plot_wrt_target(data, "Avg_Open_To_Buy", "Attrition_Flag")
Observations Avg_Open_To_Buy is identically distributed for existing customer and attrited customers.
Let's see the attributes that have a strong correlation with each other
Correlation Check
#Creating correlation matrix to show any correlation between variables
plt.figure(figsize=(15, 7))
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
sns.heatmap(data[numeric_columns].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Observations *Values of 1 are highly positively correlated, values of -1 are highly negatively correlated.
*Avg_Open_to_Buy and Credit_Limit are completely positively correlated by necessity. As a customer's credit limit goes up, their open to buy also increases.
*Total_Trans_Amt and Total_Trans_Ct are very highly positively correlated. This is natural as the more number of transations a customer makes, the more money customer will spend.
*Customer_Age and Months_on_book are highly positively correlated, as customer Age increases, their time with the bank increases.
*Total_Revolving balance and Avg_Utilization_Ratio is positively correlated. This makes sense because if a customer has a high utilization, they will likely have a higher revolving balance.
*Avg_Open_To_Buy and Avg_Utilization_Ratio are negatively correlated as higher the customers utilization is, the less their amount open to buy will be.
*Credit_Limit and Avg_Utilization_Ratio are negatively correlated,as customers with a higher credit limit tend to have a lower utilization.
Q1 = data.quantile(0.25,numeric_only=True) # To find the 25th percentile
Q3 = data.quantile(0.75, numeric_only=True) # To find the 75th percentile
IQR = Q3 - Q1 # Inter Quantile Range (75th perentile - 25th percentile)
# Finding lower and upper bounds for all values. All values outside these bounds are outliers
lower = (Q1 - 1.5 * IQR)
upper = (Q3 + 1.5 * IQR)
# checking the % outliers
((data.select_dtypes(include=["float64", "int64"]) < lower) | (data.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(data) * 100
Attrition_Flag 16.066 Customer_Age 0.020 Dependent_count 0.000 Months_on_book 3.812 Total_Relationship_Count 0.000 Months_Inactive_12_mon 3.268 Contacts_Count_12_mon 6.211 Credit_Limit 9.717 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 9.509 Total_Amt_Chng_Q4_Q1 3.910 Total_Trans_Amt 8.848 Total_Trans_Ct 0.020 Total_Ct_Chng_Q4_Q1 3.891 Avg_Utilization_Ratio 0.000 dtype: float64
# creating the copy of the dataframe
data1 = data.copy()
data1.isna().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 1112 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
Observation - to avoid Data leakage, all the null values Education_Level(1519), Marital_Status(749),Income_Category(1112) in columns above will be imputed after train-test data split
# Creating a list with column labels that need to be converted from "object" to "category" data type.
cat_cols = [
'Attrition_Flag',
'Gender',
'Education_Level',
'Marital_Status',
'Card_Category',
'Income_Category'
]
# Converting the columns with "object" data type to "category" data type.
data[cat_cols] = data[cat_cols].astype('category')
Observation Converted columns with data type of "object" to "category" for further use in model building & analysis.
# Verifying conversion of 'Object' to 'category' the data types of the new data frame.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 9015 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(6), float64(5), int64(9) memory usage: 1.1 MB
Observations Observed ''Attrition_Flag', 'Gender', 'Education_Level', 'Marital_Status', 'Card_Category', 'Income_Category' got converted to Category dtypes
# Dividing train data into X and y
X = data1.drop(["Attrition_Flag"], axis=1)
y = data1["Attrition_Flag"]
# Splitting data into training, validation and test set
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.20, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
# creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")
reqd_col_for_impute = ["Education_Level", "Marital_Status", "Income_Category"]
# Fitting and transforming the train data to impute missing values in X_train set
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])
# Fitting & Transforming the validation data to impute missing values in X_val set
X_val[reqd_col_for_impute] = imputer.fit_transform(X_val[reqd_col_for_impute])
# Fitting & Transform the test data to impute missing values in X_test set
X_test[reqd_col_for_impute] = imputer.fit_transform(X_test[reqd_col_for_impute])
# Printing no column is showing missing values in train, val, test set.
print(X_train.isna().sum())
print("*" * 40)
print(X_val.isna().sum())
print("*" * 40)
print(X_test.isna().sum())
print("*" * 40)
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 **************************************** Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 **************************************** Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ****************************************
# Printing & verifying the size & percentages of classes of the Training, Validation, and Test data frames, after missing value imputation.
print("*"*40)
print("Shape of Training Set : ", X_train.shape)
print("Shape of Validation Set", X_val.shape)
print("Shape of Test Set : ", X_test.shape)
print("*"*40)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("*"*40)
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print("*"*40)
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
print("*"*40)
**************************************** Shape of Training Set : (6075, 19) Shape of Validation Set (2026, 19) Shape of Test Set : (2026, 19) **************************************** Percentage of classes in training set: Attrition_Flag 0 0.839 1 0.161 Name: proportion, dtype: float64 **************************************** Percentage of classes in validation set: Attrition_Flag 0 0.839 1 0.161 Name: proportion, dtype: float64 **************************************** Percentage of classes in test set: Attrition_Flag 0 0.840 1 0.160 Name: proportion, dtype: float64 ****************************************
Observations *Splitted data successfully into training, validation, and test sets.
*All Models will be trained on training data, and evaluated on validation data.
*The best models will be tuned and finally evaluated on the test data.(prod data)
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 40)
Gender F 3193 M 2882 Name: count, dtype: int64 **************************************** Education_Level Graduate 2782 High School 1228 Uneducated 881 College 618 Post-Graduate 312 Doctorate 254 Name: count, dtype: int64 **************************************** Marital_Status Married 3276 Single 2369 Divorced 430 Name: count, dtype: int64 **************************************** Income_Category Less than $40K 2783 $40K - $60K 1059 $80K - $120K 953 $60K - $80K 831 $120K + 449 Name: count, dtype: int64 **************************************** Card_Category Blue 5655 Silver 339 Gold 69 Platinum 12 Name: count, dtype: int64 ****************************************
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 40)
Gender F 1095 M 931 Name: count, dtype: int64 **************************************** Education_Level Graduate 917 High School 404 Uneducated 306 College 199 Post-Graduate 101 Doctorate 99 Name: count, dtype: int64 **************************************** Marital_Status Married 1100 Single 770 Divorced 156 Name: count, dtype: int64 **************************************** Income_Category Less than $40K 957 $40K - $60K 361 $80K - $120K 293 $60K - $80K 279 $120K + 136 Name: count, dtype: int64 **************************************** Card_Category Blue 1905 Silver 97 Gold 21 Platinum 3 Name: count, dtype: int64 ****************************************
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_test[i].value_counts())
print("*" * 40)
Gender F 1070 M 956 Name: count, dtype: int64 **************************************** Education_Level Graduate 948 High School 381 Uneducated 300 College 196 Post-Graduate 103 Doctorate 98 Name: count, dtype: int64 **************************************** Marital_Status Married 1060 Single 804 Divorced 162 Name: count, dtype: int64 **************************************** Income_Category Less than $40K 933 $40K - $60K 370 $60K - $80K 292 $80K - $120K 289 $120K + 142 Name: count, dtype: int64 **************************************** Card_Category Blue 1876 Silver 119 Gold 26 Platinum 5 Name: count, dtype: int64 ****************************************
#using drop first to avoid multicolonarity, Dropping first of each encoded column to reduce data frame size.
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val,drop_first=True)
X_test = pd.get_dummies(X_test,drop_first=True)
# Printing shape of new dataframe
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 29) (2026, 29) (2026, 29)
Observations
*Encoded categorical columns for model building.
*Dropped 1 dummy variable column from each category as its not needed to have all columns.
*After encoding there are 29 columns (including dummies)
# Checking information of new train data frame columns (29)
X_train.info()
<class 'pandas.core.frame.DataFrame'> Index: 6075 entries, 800 to 4035 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 6075 non-null int64 1 Dependent_count 6075 non-null int64 2 Months_on_book 6075 non-null int64 3 Total_Relationship_Count 6075 non-null int64 4 Months_Inactive_12_mon 6075 non-null int64 5 Contacts_Count_12_mon 6075 non-null int64 6 Credit_Limit 6075 non-null float64 7 Total_Revolving_Bal 6075 non-null int64 8 Avg_Open_To_Buy 6075 non-null float64 9 Total_Amt_Chng_Q4_Q1 6075 non-null float64 10 Total_Trans_Amt 6075 non-null int64 11 Total_Trans_Ct 6075 non-null int64 12 Total_Ct_Chng_Q4_Q1 6075 non-null float64 13 Avg_Utilization_Ratio 6075 non-null float64 14 Gender_M 6075 non-null bool 15 Education_Level_Doctorate 6075 non-null bool 16 Education_Level_Graduate 6075 non-null bool 17 Education_Level_High School 6075 non-null bool 18 Education_Level_Post-Graduate 6075 non-null bool 19 Education_Level_Uneducated 6075 non-null bool 20 Marital_Status_Married 6075 non-null bool 21 Marital_Status_Single 6075 non-null bool 22 Income_Category_$40K - $60K 6075 non-null bool 23 Income_Category_$60K - $80K 6075 non-null bool 24 Income_Category_$80K - $120K 6075 non-null bool 25 Income_Category_Less than $40K 6075 non-null bool 26 Card_Category_Gold 6075 non-null bool 27 Card_Category_Platinum 6075 non-null bool 28 Card_Category_Silver 6075 non-null bool dtypes: bool(15), float64(5), int64(9) memory usage: 800.9 KB
# Checking information of new validation set data frame's columns.
X_val.info()
<class 'pandas.core.frame.DataFrame'> Index: 2026 entries, 2894 to 6319 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 2026 non-null int64 1 Dependent_count 2026 non-null int64 2 Months_on_book 2026 non-null int64 3 Total_Relationship_Count 2026 non-null int64 4 Months_Inactive_12_mon 2026 non-null int64 5 Contacts_Count_12_mon 2026 non-null int64 6 Credit_Limit 2026 non-null float64 7 Total_Revolving_Bal 2026 non-null int64 8 Avg_Open_To_Buy 2026 non-null float64 9 Total_Amt_Chng_Q4_Q1 2026 non-null float64 10 Total_Trans_Amt 2026 non-null int64 11 Total_Trans_Ct 2026 non-null int64 12 Total_Ct_Chng_Q4_Q1 2026 non-null float64 13 Avg_Utilization_Ratio 2026 non-null float64 14 Gender_M 2026 non-null bool 15 Education_Level_Doctorate 2026 non-null bool 16 Education_Level_Graduate 2026 non-null bool 17 Education_Level_High School 2026 non-null bool 18 Education_Level_Post-Graduate 2026 non-null bool 19 Education_Level_Uneducated 2026 non-null bool 20 Marital_Status_Married 2026 non-null bool 21 Marital_Status_Single 2026 non-null bool 22 Income_Category_$40K - $60K 2026 non-null bool 23 Income_Category_$60K - $80K 2026 non-null bool 24 Income_Category_$80K - $120K 2026 non-null bool 25 Income_Category_Less than $40K 2026 non-null bool 26 Card_Category_Gold 2026 non-null bool 27 Card_Category_Platinum 2026 non-null bool 28 Card_Category_Silver 2026 non-null bool dtypes: bool(15), float64(5), int64(9) memory usage: 267.1 KB
# Checking information of new test data frame's columns.
X_test.info()
<class 'pandas.core.frame.DataFrame'> Index: 2026 entries, 9760 to 413 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 2026 non-null int64 1 Dependent_count 2026 non-null int64 2 Months_on_book 2026 non-null int64 3 Total_Relationship_Count 2026 non-null int64 4 Months_Inactive_12_mon 2026 non-null int64 5 Contacts_Count_12_mon 2026 non-null int64 6 Credit_Limit 2026 non-null float64 7 Total_Revolving_Bal 2026 non-null int64 8 Avg_Open_To_Buy 2026 non-null float64 9 Total_Amt_Chng_Q4_Q1 2026 non-null float64 10 Total_Trans_Amt 2026 non-null int64 11 Total_Trans_Ct 2026 non-null int64 12 Total_Ct_Chng_Q4_Q1 2026 non-null float64 13 Avg_Utilization_Ratio 2026 non-null float64 14 Gender_M 2026 non-null bool 15 Education_Level_Doctorate 2026 non-null bool 16 Education_Level_Graduate 2026 non-null bool 17 Education_Level_High School 2026 non-null bool 18 Education_Level_Post-Graduate 2026 non-null bool 19 Education_Level_Uneducated 2026 non-null bool 20 Marital_Status_Married 2026 non-null bool 21 Marital_Status_Single 2026 non-null bool 22 Income_Category_$40K - $60K 2026 non-null bool 23 Income_Category_$60K - $80K 2026 non-null bool 24 Income_Category_$80K - $120K 2026 non-null bool 25 Income_Category_Less than $40K 2026 non-null bool 26 Card_Category_Gold 2026 non-null bool 27 Card_Category_Platinum 2026 non-null bool 28 Card_Category_Silver 2026 non-null bool dtypes: bool(15), float64(5), int64(9) memory usage: 267.1 KB
Model can make wrong predictions as:
Which case is more important?
How to reduce this loss i.e need to reduce False Negatives??
-The bank should prioritize maximizing Recall, as a higher Recall reduces the likelihood of false negatives. This means focusing on increasing Recall or minimizing false negatives, effectively identifying true positives (i.e., Class 1). By accurately identifying customers at risk of attrition, the bank can better retain its valuable customers.
Let's define a function to output different metrics (including recall) on the train and test data sets and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Defining a function to create a confusion matrix to check TP, FP, TN, adn FN values.
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Getting Recall scores for 6 models that were fit on orginal training data.
# Appending all the models into the list
models = []
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1))) # Append Gradient Boosting
models.append(("AdaBoost", AdaBoostClassifier(random_state=1))) # Append AdaBoost
models.append(("Decisiontree",DecisionTreeClassifier(random_state=1))) # Append DecisionTree
models.append(("XGB", XGBClassifier(random_state=1)))#Append XG Boost
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Training Performance: Bagging: 0.985655737704918 Random forest: 1.0 Gradient Boost: 0.875 AdaBoost: 0.826844262295082 Decisiontree: 1.0 XGB: 1.0 Validation Performance: Bagging: 0.8128834355828221 Random forest: 0.7975460122699386 Gradient Boost: 0.8558282208588958 AdaBoost: 0.852760736196319 Decisiontree: 0.8159509202453987 XGB: 0.901840490797546
#Getting Training and Validation Performance Difference
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
# Fit the model on the training data
model.fit(X_train, y_train)
# Calculate recall scores for training and validation sets
scores_train = recall_score(y_train, model.predict(X_train))
scores_val = recall_score(y_val, model.predict(X_val))
difference1 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference1))
Training and Validation Performance Difference: Bagging: Training Score: 0.9857, Validation Score: 0.8129, Difference: 0.1728 Random forest: Training Score: 1.0000, Validation Score: 0.7975, Difference: 0.2025 Gradient Boost: Training Score: 0.8750, Validation Score: 0.8558, Difference: 0.0192 AdaBoost: Training Score: 0.8268, Validation Score: 0.8528, Difference: -0.0259 Decisiontree: Training Score: 1.0000, Validation Score: 0.8160, Difference: 0.1840 XGB: Training Score: 1.0000, Validation Score: 0.9018, Difference: 0.0982
Observations Model Building - Original Data
The top 3 models based on the Validation Recall scores and performance difference are:
These models have the highest validation scores, and their performance differences indicate a balance between fitting and generalizing well on unseen data.
print("Before Oversampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print("After Oversampling, counts of label 'Yes': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Yes': 976 Before Oversampling, counts of label 'No': 5099 After Oversampling, counts of label 'Yes': 5099 After Oversampling, counts of label 'No': 5099 After Oversampling, the shape of train_X: (10198, 29) After Oversampling, the shape of train_y: (10198,)
# Getting Recall scores for 6 models that were fit on oversampled data.
# Appending all the models into the list
models = [] # Empty list to store all the models
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1))) # Append Gradient Boosting
models.append(("AdaBoost", AdaBoostClassifier(random_state=1))) # Append AdaBoost
models.append(("Decisiontree",DecisionTreeClassifier(random_state=1))) # Append DecisionTree
models.append(("XGB", XGBClassifier(random_state=1)))#Append XG Boost
print("\n" "Training Performance:" "\n")
for name, model in models:
# Fit the model on the training data
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Training Performance: Bagging: 0.9976465973720338 Random forest: 1.0 Gradient Boost: 0.9792116101196313 AdaBoost: 0.964698960580506 Decisiontree: 1.0 XGB: 1.0 Validation Performance: Bagging: 0.8619631901840491 Random forest: 0.8619631901840491 Gradient Boost: 0.9049079754601227 AdaBoost: 0.901840490797546 Decisiontree: 0.8650306748466258 XGB: 0.9294478527607362
#Getting Training and Validation Performance Difference on oversampled data
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores_train = recall_score(y_train_over, model.predict(X_train_over))
scores_val = recall_score(y_val, model.predict(X_val))
difference2 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference2))
Training and Validation Performance Difference: Bagging: Training Score: 0.9976, Validation Score: 0.8620, Difference: 0.1357 Random forest: Training Score: 1.0000, Validation Score: 0.8620, Difference: 0.1380 Gradient Boost: Training Score: 0.9792, Validation Score: 0.9049, Difference: 0.0743 AdaBoost: Training Score: 0.9647, Validation Score: 0.9018, Difference: 0.0629 Decisiontree: Training Score: 1.0000, Validation Score: 0.8650, Difference: 0.1350 XGB: Training Score: 1.0000, Validation Score: 0.9294, Difference: 0.0706
Observations Model Building - Oversampled Data
The top 3 models based on the Validation Recall scores and the differences are:
1.XGBoost (XGB)
2.Gradient Boost
3.AdaBoost
These models have the highest validation scores and relatively low differences between training and validation scores, indicating a good balance between fitting the training data and generalizing well to unseen data.
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
print("Before Under Sampling, counts of label 'Yes': {}".format(sum(y_train == 1)))
print("Before Under Sampling, counts of label 'No': {} \n".format(sum(y_train == 0)))
print("After Under Sampling, counts of label 'Yes': {}".format(sum(y_train_un == 1)))
print("After Under Sampling, counts of label 'No': {} \n".format(sum(y_train_un == 0)))
print("After Under Sampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Under Sampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Under Sampling, counts of label 'Yes': 976 Before Under Sampling, counts of label 'No': 5099 After Under Sampling, counts of label 'Yes': 976 After Under Sampling, counts of label 'No': 976 After Under Sampling, the shape of train_X: (1952, 29) After Under Sampling, the shape of train_y: (1952,)
# Getting Recall scores for 6 models that were fit on undersampled data.
# Appending all the models into the list
models = []
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boost", GradientBoostingClassifier(random_state=1))) # Append Gradient Boosting
models.append(("AdaBoost", AdaBoostClassifier(random_state=1))) # Append AdaBoost
models.append(("Decisiontree",DecisionTreeClassifier(random_state=1))) # Append DecisionTree
models.append(("XGB", XGBClassifier(random_state=1)))#Append XG Boost
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un))
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Training Performance: Bagging: 0.9907786885245902 Random forest: 1.0 Gradient Boost: 0.9805327868852459 AdaBoost: 0.9528688524590164 Decisiontree: 1.0 XGB: 1.0 Validation Performance: Bagging: 0.9294478527607362 Random forest: 0.9386503067484663 Gradient Boost: 0.9570552147239264 AdaBoost: 0.9601226993865031 Decisiontree: 0.9202453987730062 XGB: 0.9693251533742331
#Getting Training and Validation Performance Difference on undersampled data
print("\nTraining and Validation Performance Difference:\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores_train = recall_score(y_train_un, model.predict(X_train_un))
scores_val = recall_score(y_val, model.predict(X_val))
difference3 = scores_train - scores_val
print("{}: Training Score: {:.4f}, Validation Score: {:.4f}, Difference: {:.4f}".format(name, scores_train, scores_val, difference3))
Training and Validation Performance Difference: Bagging: Training Score: 0.9908, Validation Score: 0.9294, Difference: 0.0613 Random forest: Training Score: 1.0000, Validation Score: 0.9387, Difference: 0.0613 Gradient Boost: Training Score: 0.9805, Validation Score: 0.9571, Difference: 0.0235 AdaBoost: Training Score: 0.9529, Validation Score: 0.9601, Difference: -0.0073 Decisiontree: Training Score: 1.0000, Validation Score: 0.9202, Difference: 0.0798 XGB: Training Score: 1.0000, Validation Score: 0.9693, Difference: 0.0307
Observations
The top 3 models based on the Validation Recall scores and diffrences are:
1.XGBoost (XGB)
2.AdaBoost
3.Gradient Boost
These models have the highest validation scores and the smallest differences between training and validation scores, indicating a strong balance between fitting the training data and generalizing well to unseen data.
#Getting Best Parameters,CV score using RandomizedSearchCV
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)## to fit the model on original data
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.8360596546310832:
CPU times: user 3.98 s, sys: 270 ms, total: 4.25 s
Wall time: 1min 47s
# Creating new pipeline with best parameters
tuned_adb_orig = AdaBoostClassifier(random_state=1,
n_estimators= 100 , learning_rate=0.1 , base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1))
tuned_adb_orig.fit(X_train, y_train)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
adb_train_orig = model_performance_classification_sklearn(tuned_adb_orig,X_train ,y_train)
adb_train_orig
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982 | 0.927 | 0.961 | 0.944 |
# Saving the tuned model's scores for later comparison.
adb_train_orig_score = model_performance_classification_sklearn(tuned_adb_orig, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the original training data.
confusion_matrix_sklearn(tuned_adb_orig, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Original(train data)")
plt.show()
# Checking model's performance on validation set
adb_val_orig = model_performance_classification_sklearn(tuned_adb_orig, X_val, y_val)
adb_val_orig
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.967 | 0.856 | 0.933 | 0.893 |
# Saving the tuned model's scores for later comparison.
adb_val_orig_score = model_performance_classification_sklearn(tuned_adb_orig, X_val, y_val)
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_adb_orig, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Original (Validation data)")
plt.show()
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9467346938775512:
CPU times: user 1.84 s, sys: 119 ms, total: 1.96 s
Wall time: 41.1 s
# Creating new pipeline with best parameters
tuned_ada_un = AdaBoostClassifier( random_state=1,
n_estimators=100, learning_rate= 0.05, base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1)
)
tuned_ada_un.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.05, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.05, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
adb_train_un = model_performance_classification_sklearn(tuned_ada_un, X_train_un, y_train_un)
adb_train_un
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.973 | 0.978 | 0.968 | 0.973 |
#Saving the tuned model's scores for later comparison.
adb_train_un_score = model_performance_classification_sklearn(tuned_ada_un, X_train_un, y_train_un)
# Creating the confusion matrix for the tuned model's performance on the Undersample training data.
confusion_matrix_sklearn (tuned_ada_un, X_train_un, y_train_un)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Undersample(train data)")
plt.show()
# Checking model's performance on validation set
adb_val_un = model_performance_classification_sklearn(tuned_ada_un, X_val, y_val)
adb_val_un
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.937 | 0.966 | 0.731 | 0.832 |
# Creating the confusion matrix for the tuned model's performance on the Undersample validation data.
confusion_matrix_sklearn(tuned_ada_un, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Undersample(Validation data)")
plt.show()
%%time
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9515668956493293:
CPU times: user 6.72 s, sys: 338 ms, total: 7.06 s
Wall time: 2min 52s
# Creating new pipeline with best parameters
tuned_ada_over = AdaBoostClassifier( random_state=1, n_estimators= 100, learning_rate= 0.1, base_estimator= DecisionTreeClassifier(max_depth=3, random_state=1))
tuned_ada_over.fit( X_train_over, y_train_over)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=100, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
adb_train_over = model_performance_classification_sklearn(tuned_ada_over, X_train_over, y_train_over)
adb_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.985 | 0.985 | 0.985 | 0.985 |
#Saving the tuned model's scores for later comparison.
adb_train_over_score = model_performance_classification_sklearn(tuned_ada_over, X_train_over, y_train_over)
# Creating the confusion matrix for the tuned model's performance on the oversampled training data.
confusion_matrix_sklearn (tuned_ada_over, X_train_over, y_train_over)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Oversample(train data)")
plt.show()
# Checking model's performance on validation set
adb_val_over = model_performance_classification_sklearn(tuned_ada_over, X_val, y_val)
adb_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968 | 0.908 | 0.894 | 0.901 |
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_ada_over, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost Original(Validation data)")
plt.show()
%%time
#defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8104395604395604:
CPU times: user 3.9 s, sys: 412 ms, total: 4.32 s
Wall time: 2min 41s
# Creating new pipeline with best parameters
tuned_gbm_orig = GradientBoostingClassifier(
max_features=0.5,
init=AdaBoostClassifier(random_state=1),
random_state=1,
learning_rate=0.1,
n_estimators= 100,
subsample=0.9,
)
tuned_gbm_orig.fit(X_train, y_train)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
gbm_train_orig = model_performance_classification_sklearn(
tuned_gbm_orig, X_train, y_train)
gbm_train_orig
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.972 | 0.867 | 0.955 | 0.909 |
# Saving the tuned model's scores for later comparison.
gbm_train_orig_score = model_performance_classification_sklearn(tuned_gbm_orig, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the original training data.
confusion_matrix_sklearn(tuned_gbm_orig, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Original(train data)")
plt.show()
# Checking model's performance on validation set
gbm_val_orig = model_performance_classification_sklearn(tuned_gbm_orig, X_val, y_val)
gbm_val_orig
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.968 | 0.862 | 0.937 | 0.898 |
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_gbm_orig, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Original(Validation data)")
plt.show()
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 0.7, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9508267922553637:
CPU times: user 2.16 s, sys: 184 ms, total: 2.35 s
Wall time: 1min 13s
# Creating new pipeline with best parameters
tuned_gbm_un = GradientBoostingClassifier(
max_features=0.7,
init=AdaBoostClassifier(random_state=1),
random_state=1,
learning_rate=0.1,
n_estimators=75,
subsample=0.9,
)
tuned_gbm_un.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.7, n_estimators=75, random_state=1,
subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
gbm_train_un = model_performance_classification_sklearn(tuned_gbm_un,X_train_un ,y_train_un)
gbm_train_un
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.970 | 0.977 | 0.964 | 0.970 |
# Saving the tuned model's scores for later comparison.
gbm_train_un_score = model_performance_classification_sklearn(tuned_gbm_un, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the undersampled training data.
confusion_matrix_sklearn(tuned_gbm_un, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Undersample(train data)")
plt.show()
gbm_val_un = model_performance_classification_sklearn(tuned_gbm_un,X_val ,y_val)
gbm_val_un
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938 | 0.957 | 0.738 | 0.833 |
# Creating the confusion matrix for the tuned model's performance on the undersample validation data.
confusion_matrix_sklearn(tuned_gbm_un, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Undersample(Validation data)")
plt.show()
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9541157228347668:
CPU times: user 5.97 s, sys: 632 ms, total: 6.6 s
Wall time: 4min 23s
# Creating new pipeline with best parameters
tuned_gbm_over = GradientBoostingClassifier(
max_features=0.5,
init=AdaBoostClassifier(random_state=1),
random_state=1,
learning_rate=0.1,
n_estimators=100,
subsample=0.9,
)
tuned_gbm_over.fit(X_train_over, y_train_over)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
max_features=0.5, random_state=1, subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
gbm_train_over = model_performance_classification_sklearn(tuned_gbm_over,X_train_over ,y_train_over)
gbm_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.975 | 0.979 | 0.972 | 0.975 |
# Saving the tuned model's scores for later comparison.
gbm_train_over_score = model_performance_classification_sklearn(tuned_gbm_over, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the oversampled training data.
confusion_matrix_sklearn(tuned_gbm_over, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Oversample(train data)")
plt.show()
gbm_val_over = model_performance_classification_sklearn(tuned_gbm_over,X_val ,y_val)
gbm_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.961 | 0.911 | 0.853 | 0.881 |
# Creating the confusion matrix for the tuned model's performance on the oversampled validation data.
confusion_matrix_sklearn(tuned_gbm_over, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - GradientBoost Oversample(Validation data)")
plt.show()
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'scale_pos_weight': 5, 'n_estimators': 100, 'learning_rate': 0.1, 'gamma': 3} with CV score=0.921098901098901:
CPU times: user 2.12 s, sys: 195 ms, total: 2.32 s
Wall time: 1min 5s
tuned_xgb_orig = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.9,
scale_pos_weight=5,
n_estimators=100,
learning_rate=0.1,
gamma=3,
)
tuned_xgb_orig.fit(X_train, y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)xgb_train_orig = model_performance_classification_sklearn (tuned_xgb_orig,X_train ,y_train)
xgb_train_orig
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.988 | 1.000 | 0.932 | 0.965 |
# Saving the tuned model's scores for later comparison.
xgb_train_orig_score = model_performance_classification_sklearn(tuned_xgb_orig, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the original training data.
confusion_matrix_sklearn(tuned_xgb_orig, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Original(train data)")
plt.show()
xgb_val_orig = model_performance_classification_sklearn(tuned_xgb_orig,X_val ,y_val)
xgb_val_orig
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.965 | 0.942 | 0.855 | 0.896 |
# Creating the confusion matrix for the tuned model's performance on the original validation data.
confusion_matrix_sklearn(tuned_xgb_orig, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Original(validation data)")
plt.show()
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.01, 'gamma': 3} with CV score=0.9979591836734695:
CPU times: user 1.61 s, sys: 109 ms, total: 1.72 s
Wall time: 38.4 s
tuned_xgb_un = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.7,
scale_pos_weight=5,
n_estimators=100,
learning_rate=0.01,
gamma=3,
)
tuned_xgb_un.fit(X_train_un, y_train_un)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=100,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)xgb_train_un = model_performance_classification_sklearn(tuned_xgb_un, X_train, y_train)
xgb_train_un
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.779 | 1.000 | 0.421 | 0.593 |
# Saving the tuned model's scores for later comparison.
xgb_train_un_score = model_performance_classification_sklearn(tuned_xgb_un, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the undersampled training data.
confusion_matrix_sklearn(tuned_xgb_un, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Undersample(train data)")
plt.show()
xgb_val_un = model_performance_classification_sklearn(tuned_xgb_un, X_val, y_val)
xgb_val_un
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.777 | 0.994 | 0.419 | 0.590 |
# Creating the confusion matrix for the tuned model's performance on the undersampled validation data.
confusion_matrix_sklearn(tuned_xgb_un, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Undersample(validation data)")
plt.show()
%%time
# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')
#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
from sklearn import metrics
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 50, 'learning_rate': 0.01, 'gamma': 3} with CV score=0.9994117647058823:
CPU times: user 2.57 s, sys: 285 ms, total: 2.86 s
Wall time: 1min 25s
tuned_xgb_over = XGBClassifier(
random_state=1,
eval_metric="logloss",
subsample=0.7,
scale_pos_weight=5,
n_estimators=50,
learning_rate=0.01,
gamma=3,
)
tuned_xgb_over.fit(X_train_over, y_train_over)
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=3, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.01, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)xgb_train_over = model_performance_classification_sklearn(tuned_xgb_over,X_train_over ,y_train_over)
xgb_train_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.792 | 1.000 | 0.706 | 0.828 |
# Saving the tuned model's scores for later comparison.
xgb_train_over_score = model_performance_classification_sklearn(tuned_xgb_over, X_train, y_train)
# Creating the confusion matrix for the tuned model's performance on the Oversample training data.
confusion_matrix_sklearn(tuned_xgb_over, X_train, y_train)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Oversample(train data)")
plt.show()
xgb_val_over = model_performance_classification_sklearn(tuned_xgb_over, X_val, y_val)
xgb_val_over
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.655 | 1.000 | 0.318 | 0.483 |
# Creating the confusion matrix for the tuned model's performance on the Oversampled validation data.
confusion_matrix_sklearn(tuned_xgb_over, X_val, y_val)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - XGBoost Oversample(validation data)")
plt.show()
# training performance comparison
models_train_comp_df = pd.concat(
[
xgb_train_orig.T,
gbm_train_orig.T,
adb_train_orig.T,
xgb_train_over.T,
gbm_train_over.T,
adb_train_over.T,
xgb_train_un.T,
gbm_train_un.T,
adb_train_un.T,
],
axis=1,
)
models_train_comp_df.columns = [
"XGBoost trained with Original data",
"Gradient boosting trained with Original data",
"AdaBoost trained with Original data",
"XGBoost trained with Oversampled data",
"Gradient boosting trained with Oversampled data",
"AdaBoost trained with Oversampled data",
"XGBoost boosting trained with Undersampled data",
"Gradient trained with Undersampled data",
"AdaBoost trained with Undersampled data"
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| XGBoost trained with Original data | Gradient boosting trained with Original data | AdaBoost trained with Original data | XGBoost trained with Oversampled data | Gradient boosting trained with Oversampled data | AdaBoost trained with Oversampled data | XGBoost boosting trained with Undersampled data | Gradient trained with Undersampled data | AdaBoost trained with Undersampled data | |
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.988 | 0.972 | 0.982 | 0.792 | 0.975 | 0.985 | 0.779 | 0.970 | 0.973 |
| Recall | 1.000 | 0.867 | 0.927 | 1.000 | 0.979 | 0.985 | 1.000 | 0.977 | 0.978 |
| Precision | 0.932 | 0.955 | 0.961 | 0.706 | 0.972 | 0.985 | 0.421 | 0.964 | 0.968 |
| F1 | 0.965 | 0.909 | 0.944 | 0.828 | 0.975 | 0.985 | 0.593 | 0.970 | 0.973 |
# validation performance comparison
models_val_comp_df = pd.concat(
[
xgb_val_orig.T,
gbm_val_orig.T,
adb_val_orig.T,
xgb_val_over.T,
gbm_val_over.T,
adb_val_over.T,
xgb_val_un.T,
gbm_val_un.T,
adb_val_un.T,
],
axis=1,
)
models_train_comp_df.columns = [
"XGBoost Validation with Original data",
"Gradient boosting Validation with Original data",
"AdaBoost Validation with Original data",
"XGBoost Validation with Oversampled data",
"Gradient boosting Validation with Oversampled data",
"AdaBoost Validation with Oversampled data",
"XGBoost boosting Validation with Undersampled data",
"Gradient boosting with Undersampled data",
"AdaBoost Validation with Undersampled data"
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
| XGBoost Validation with Original data | Gradient boosting Validation with Original data | AdaBoost Validation with Original data | XGBoost Validation with Oversampled data | Gradient boosting Validation with Oversampled data | AdaBoost Validation with Oversampled data | XGBoost boosting Validation with Undersampled data | Gradient boosting with Undersampled data | AdaBoost Validation with Undersampled data | |
|---|---|---|---|---|---|---|---|---|---|
| Accuracy | 0.988 | 0.972 | 0.982 | 0.792 | 0.975 | 0.985 | 0.779 | 0.970 | 0.973 |
| Recall | 1.000 | 0.867 | 0.927 | 1.000 | 0.979 | 0.985 | 1.000 | 0.977 | 0.978 |
| Precision | 0.932 | 0.955 | 0.961 | 0.706 | 0.972 | 0.985 | 0.421 | 0.964 | 0.968 |
| F1 | 0.965 | 0.909 | 0.944 | 0.828 | 0.975 | 0.985 | 0.593 | 0.970 | 0.973 |
Best Model Selection Analysis
1.AdaBoost (Oversampled Data):
AdaBoost with Oversampled Data maintains a high recall (0.985) with perfect precision (0.985) and high accuracy (0.985), suggesting a well-balanced model that generalizes better.
AdaBoost trained with oversampled data is also the good model It demonstrates the highest and most balanced performance across all key metrics (Accuracy, Recall, Precision, F1) on the validation dataset. This indicates strong generalization and robustness, making it the ideal candidate for further tuning.
2.AdaBoost (Undersampled Data):
High and consistent performance with Accuracy: 0.977, Recall: 0.980, Precision: 0.974, and F1: 0.977
It demonstrates high and consistent performance across all key metrics (Accuracy, Recall, Precision, F1 score) on both training and validation datasets, making it a robust and reliable choice for prediction.
3. XGBoost (Original Data) :
XGBoost with original data also performs exceptionally well, especially considering its perfect recall and high F1 score.
The drop in precision to 0.932 suggests that it might be favoring recall at the cost of precision, potentially indicating overfitting.
Model Selection Conclusion :
Considering the balance between recall and overall model performance to avoid overfitting, the best model is AdaBoost with Oversampled Data, which achieves a high recall of 0.985 while maintaining strong performance across other metrics.
Observation:
final_model = AdaBoost with oversampled data.
AdaBoost trained with oversampled data is the best model for hyperparameter tuning. It demonstrates the highest and most balanced performance across all key metrics (Accuracy, Recall, Precision, F1) on the validation dataset. This indicates strong generalization and robustness, making it the ideal candidate for further tuning.
Now we have our final model, so let's find out how our final model is performing on unseen test data.
# Let's check the performance on test set
final_model = tuned_ada_over
Model_test = model_performance_classification_sklearn(final_model, X_test, y_test)
Model_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.967 | 0.929 | 0.873 | 0.900 |
# Plot confusion matrix on test set
confusion_matrix_sklearn(final_model, X_test, y_test)
plt.xlabel("Predicted"),
plt.ylabel("Actual"),
plt.title("Confusion Matrix - AdaBoost (Test Set)")
plt.show()
# Let's identify the iportant features
feature_names = X_train.columns
importances = final_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observations:
Top 6 important features of the data set are:
Total_Trans_Amt
Total_Trans_Ct
Total_Revolving_Bal
Total_Amt_Chng_Q4_Q1
Total_Ct_Chng_Q4_Q1
Total_Relationship_Ct
Key Attrition Drivers:
Total_Trans_Ct: Customers with fewer transactions are more prone to attrition. Incentives can encourage card usage.
Total_Revolving_Bal: Extreme balances contribute to higher attrition rates. Managing these balances can mitigate risk.
Total_Amt_Chng_Q4_Q1: Major Change in the transaction amount for later year spending could have the impact on attrition.
Total_Ct_Chng_Q4_Q1: Variations in spending patterns among existing customers suggest a propensity for later-year spending.
Total_Relationship_Count: Customers with fewer bank products are more likely to attrit. Enhancing product offerings and investigating issues related to product usage can aid retention.
Additional Insights:
Proactive Customer Retention:
Targeted Marketing:
Boost Engagement:
Cross-Selling:
Retention Initiatives:
Proactive Monitoring:
By focusing on enhancing customer engagement and monitoring transaction behaviors, Thera Bank can better manage customer retention and strengthen its financial performance.